movie clips
VIOLIN: A Large-Scale Dataset for Video-and-Language Inference
Liu, Jingzhou, Chen, Wenhu, Cheng, Yu, Gan, Zhe, Yu, Licheng, Yang, Yiming, Liu, Jingjing
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text. Given a video clip with aligned subtitles as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip. A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video. These video clips contain rich content with diverse temporal dynamics, event shifts, and people interactions, collected from two sources: (i) popular TV shows, and (ii) movie clips from YouTube channels. In order to address our new multimodal inference task, a model is required to possess sophisticated reasoning skills, from surface-level grounding (e.g., identifying objects and characters in the video) to in-depth commonsense reasoning (e.g., inferring causal relations of events in the video). We present a detailed analysis of the dataset and an extensive evaluation over many strong baselines, providing valuable insights on the challenges of this new task.
Examining the Effects of Emotional Valence and Arousal on Takeover Performance in Conditionally Automated Driving
Du, Na, Zhou, Feng, Pulver, Elizabeth, Tilbury, Dawn M., Robert, Lionel P., Pradhan, Anuj K., Yang, X. Jessie
In conditionally automated driving, drivers have difficulty in takeover transitions as they become increasingly decoupled from the operational level of driving. Factors influencing takeover performance, such as takeover lead time and the engagement of non-driving related tasks, have been studied in the past. However, despite the important role emotions play in human-machine interaction and in manual driving, little is known about how emotions influence drivers takeover performance. This study, therefore, examined the effects of emotional valence and arousal on drivers takeover timeliness and quality in conditionally automated driving. We conducted a driving simulation experiment with 32 participants. Movie clips were played for emotion induction. Participants with different levels of emotional valence and arousal were required to take over control from automated driving, and their takeover time and quality were analyzed. Results indicate that positive valence led to better takeover quality in the form of a smaller maximum resulting acceleration and a smaller maximum resulting jerk. However, high arousal did not yield an advantage in takeover time. This study contributes to the literature by demonstrating how emotional valence and arousal affect takeover performance. The benefits of positive emotions carry over from manual driving to conditionally automated driving while the benefits of arousal do not.
Can computers be trained to understand body language?
Humans are able to "read" others' body language for cues on their emotional state. For instance, noticing that a friend is nervous by their tapping foot, or that a loved one who is standing tall feels confident. Now, a team of researchers at Penn State are exploring if computers can be trained to do the same. The team is investigating whether modern computer vision techniques could match the cognitive ability of humans in recognizing bodily expressions in real-world, unconstrained situations. If so, these capabilities might allow for a large number of innovative applications in areas including information management and retrieval, public safety, patient care and social media, the researchers said.
Mid-Scale Shot Classification for Detecting Narrative Transitions in Movie Clips
Zhang, Bipeng (University of California Santa Cruz) | Jhala, Arnav (University of California Santa Cruz (UCSC))
This paper examines classification of shots in video streams for indexing and semantic analysis. We describe an approach to obtain shot motion by making use of motion estimation algorithms to estimate camera movement. We improve prior work by using the four edge regions of a frame to classify No Motion shots. We analyze a neighborhood of shots and provide a new concept, middle-scale classification. This approach relies on automated labeling of frame transitions in terms of motion across adjacent frames. These annotations form a sequential scene-group that correlates with narrative events in the videos. We introduce six middle-scale classes and the corresponding likely sequence content from three clips of the movie The Lord of the Rings : The Return of the King , demonstrate that the middle-scale classification approach successfully extracts a summary of the salient aspects of the movie. We also show direct comparison with prior work on the full movie Matrix .